1 Credits

This tutorial uses materials that was created by the CANSSI Collaborative Research Team led by Vianey Leos Barajas and Marie Auger-Méthé. I thank Fanny Dupont, Ron Togunov, Natasha Klappstein, Arturo Esquivel, Marco Gallegos Herrada, and Sofia Ruiz Suarez for their contribution to this original material.

2 Tutorial goals

The goal of this tutorial is to explore ways to pre-prepare marine movement data. The primary learning objectives are to:

  1. Regularize movement tracks that are irregular using Fastloc GPS data as an example.
  2. Create predicted tracks from error-prone data using Argos data as an example.

3 General setup

First, let’s load the packages that we will need to complete the analyses. Of course you need to have them installed first.

library(momentuHMM) # Package for fitting HMMs, we use it for crawlWrap
library(tidyverse)  # data management
library(terra)
library(tidyterra)
library(ggspatial)  # plot the data
library(sf)         # spatial data processing
library(kableExtra) # produce visually appealing tables
library(geosphere)  # to calculate step lengths from lat/lon locations - just need it installed
library(conicfit)   # needs to be installed for crawlWrap
library(here)       # To help with sourcing
library(adehabitatLT) # setNA

We are assuming that you are working from Madeira-Workshop repository main folder, and using here() to help find the good directory for each files. We will need the functions in the following file utility_functions.R.

source(here("D1-data-prep-ssm",
            "utility_functions.R"))

4 Regularising Fastloc GPS data

4.1 Narwhal movement data

We will analyze a dataset containing movement tracks of three narwhals tagged with Fastloc-GPS tags. The dataset was provided by Dr. Marianne Marcoux (Fisheries and Oceans, Canada). Dr. Marcoux provided the data only for this tutorial, please do not use for other purposes without their consent. Contact: .

For simplicity, we only examine the fastloc-GPS data from one week in August 2017.

Photo by Paul Nicklen
Photo by Paul Nicklen

First, let’s import the raw Fastloc GPS narwhal data and convert the time column to an appropriate date format.

tracks_gps_O <- read.csv("data/tracks_fastloc_gps.csv") %>%
  mutate(time = ymd_hms(time),
         ID = factor(ID))

We can have a first look at the data, to get more familiar with it.

head(tracks_gps_O)
##        ID                time        x       y loc_class
## 1 T172062 2017-08-08 00:00:00 -80.6414 72.2694       GPS
## 2 T172062 2017-08-08 00:20:00 -80.6525 72.2728       GPS
## 3 T172062 2017-08-08 00:42:00 -80.6665 72.2719       GPS
## 4 T172062 2017-08-08 00:52:00 -80.6692 72.2674       GPS
## 5 T172062 2017-08-08 01:09:00 -80.6682 72.2582       GPS
## 6 T172062 2017-08-08 01:19:00 -80.6613 72.2574       GPS

The columns/variables are: - ID: Individual identifier - time: time of location - x: longitude - y: latiture - loc_class: all GPS

The data we obtain is often quite messy with some rows/records missing information and other records duplicated. We can filter records to keep only complete location data using !is.na(x) & !is.na(y). To remove duplicate records (same time, location, and location class), we will use the lag function from dplyr, which will use the value from the previous row so that we can compare to the current row.

tracks_gps_O <- tracks_gps_O %>% 
  # remove missing locations
  filter(!is.na(x) & !is.na(y),
         # remove identical records
         !(time == lag(time) & 
             x == lag(x) & 
             y == lag(y) & 
             loc_class == lag(loc_class)))

Let’s plot the data over the bathymetry (i.e., depth of the ocean) and land layers, to get a sense of where these narwhals are swimming.

land <- st_read(here("D1-data-prep-ssm", "data", "NorthBaffin.shp"))

The raw bathymetry data used to create this raster was taken from the the GEBCO global ocean and land terrain model from https://pressbooks.bccampus.ca/ewemodel/chapter/getting-bathymetry/. Note that land values of the bathymetry layer are 0s.

bathy <- rast(here("D1-data-prep-ssm", "data", "bathy_4_narwhals.tiff"))

Let’s plot the narwhal movement data over the bathymetry and land layers.

ggplot() +
  geom_spatraster(data = bathy) +
  geom_sf(data = land, fill = "beige") +
  geom_spatial_path(data = tracks_gps_O, 
                    aes(x = x, y = y, colour = ID), crs = 4326) +
  coord_sf(expand = FALSE)

As we will see during the week, many analyses require that the locations are at regular time intervals, however Fastloc GPS data is often taken at irregular time intervals, since it depends on when the animal surface. When the data is irregular, there are two key decisions we must make, (1) the temporal resolution to use, and (2) how to address large data gaps.

4.1.1 Selecting a time interval (resolution)

The desired resolution depends on the biological question you are asking, as different behaviours and biological processes occur at different spatial and temporal scales (e.g., seasonal migration, daily movement between foraging and resting grounds, and fine scale foraging decisions), and the resolution of the raw data you have. Generally, higher resolution data is preferred as it has more information. However, it is possible to have too-high of a resolution wherein information from fine-scale variability drowns out the signal from coarse-scale patterns of interest (e.g., seasonal migration). A very coarse rule of thumb, is that you want 3-50 data points per behaviour. For behaviours spanning several hours, that roughly corresponds to a desired resolution between 2 min and 60 min.

First, let’s calculate the time difference between successive records using difftime and lead (compares current row to following row) and place these values in a new column called dt. Note that the time difference is in minutes (units = "mins"). For the last record of each individual (i.e., when ID != lead(ID)), we will set the value to NA.

# Calculate time difference between locations
tracks_gps_O <- tracks_gps_O %>%
  mutate(dt = ifelse(ID == lead(ID), # If next data row is same individual
                     # calculate time difference
                     difftime(lead(time), time, units = "mins"), 
                     NA))

Let’s see what resolutions may be possible in the data by looking at the most frequent time gaps.

# Visualise time differences (all and zoomed)
par(mfrow = c(1, 2))
hist(tracks_gps_O$dt, 1000, main = NA, xlab = "Time difference (min)")
hist(tracks_gps_O$dt, 1000, main = NA, xlab = "Time difference (min)",
     xlim = c(0,100))

# identify the most frequent dt
tracks_gps_O %>% 
  {table(.$dt)} %>% 
  sort(decreasing = TRUE) %>% 
  head()
## 
##  10  11  12  22   9  13 
## 400 145  90  64  63  63

We see that the most frequent time gap is \(10\) min, followed by \(11\), \(12\), \(22\), \(9\) and \(13\) min. We also see the majority of the gaps are \(< 60\) min, however some are in excess of \(600\) min. Finer resolutions will contain more data gaps. For some analyses, frequent and large data gaps can be difficult to handle, especially as the number of missing data points approaches or exceeds the existing data. Let’s examine the potential data structure at different resolutions for the different animals.

We can now use the function p_na (in the script utility_functions.R) to look at the proportion of NAs we would get with 10, 20, 30, and 60 min resolution.

# summarise track dt
tracks_gps_O %>% 
  group_by(ID) %>% 
  summarise(p_NA_10m = p_na(time, 10),     # 10 min 
            p_NA_20m = p_na(time, 20),     # 20 min 
            p_NA_30m = p_na(time, 30),     # 30 min 
            p_NA_60m = p_na(time, 60)) %>% # 60 min
  # return formatted table
  kable(digits = 3, col.names = c("Narwhal ID", paste(c(10,20,30,60), "m"))) %>%
  kable_styling() %>% 
  add_header_above(c("", "Resolution" = 4))
Resolution
Narwhal ID 10 m 20 m 30 m 60 m
T172062 0.486 0.236 0.191 0.152
T172064 0.502 0.203 0.120 0.074
T172066 0.488 0.230 0.185 0.160

Here we see that the \(10\) min interval, around \(50\%\) of the locations would be missing.

This is a lot! In many cases, we may want to be more conservative and use a \(30\) min resolution or \(60\) min resolution.

4.1.2 Handling data gaps

There are several ways to deal with data gaps:

  1. Split tracks
  2. Interpolate locations
  3. Fill the gaps with NAs
  4. Multiple imputation

The method to use will depend on the main analysis you are interested in.

4.1.3 Splitting tracks

One way to account for missing data is to split the track where there are large gaps (i.e., assign each track segment a new individual ID). This strategy is particularly appropriate when you have long enough gaps for which interpolation method are unlikely to perform well. We can split the tracks when the gaps larger than a predetermined threshold.

Here, we will use a function (found in utility_functions.R) to split the tracks. We define the maximum allowable gap (at which point it will start a new segment), as well as the shortest allowable track segment.

These are somewhat arbitrary decisions, and depend on your subsequent choices for regularisation. In this tutorial, we will be interpolating missing locations (within each segment) and so we only want to allow gaps that can reasonably be predicted.

We are using a 30 min resolution, so we allow a 90 minute gap (i.e., we assume we can predict 2 missing locations), and we want each segment to be at least 120 min (i.e., have at least 5 locations) long so that we have enough information about state transitions.

# Use function from utility_function.R to split data at gaps
track_split <- split_at_gap(data = tracks_gps_O, 
                           max_gap = 90, 
                           shortest_track = 120)

The new data has an updated ID column, where with the original ID and track segment number. The original ID is now in ID_old

head(track_split)
##          ID                time        x       y loc_class dt  ID_old
## 1 T172062-1 2017-08-08 00:20:00 -80.6525 72.2728       GPS 22 T172062
## 2 T172062-1 2017-08-08 00:42:00 -80.6665 72.2719       GPS 10 T172062
## 3 T172062-1 2017-08-08 00:52:00 -80.6692 72.2674       GPS 17 T172062
## 4 T172062-1 2017-08-08 01:09:00 -80.6682 72.2582       GPS 10 T172062
## 5 T172062-1 2017-08-08 01:19:00 -80.6613 72.2574       GPS 10 T172062
## 6 T172062-1 2017-08-08 01:29:00 -80.6592 72.2563       GPS 15 T172062

This data is still irregular, but now we have smaller segments split when there are large data gaps. Let’s visualize the different segment.

ggplot() +
  geom_spatraster(data = bathy) +
  geom_sf(data = land, fill = "beige") +
  geom_spatial_path(data = track_split, 
                    aes(x = x, y = y, colour = ID), crs = 4326) +
  coord_sf(expand = FALSE)

Splitting the tracks is often be a first step, before interpolating or other adjustments.

4.1.4 Interpolation (correlated random walk)

Once the track is split, there is often still irregularity within each segments, and we want to interpolate or predict new locations to form a regular grid of observations.

The simplest approach is to use linear interpolation between missing times, but a better option is to predict locations from a continuous-time correlated random walk (CTCRW). momentuHMM contains wrapper functions to interpolate missing locations by fitting a CTCRW to the data based on the crawl package by Devin Johnson and Josh London. There are many options to fit the CTCRW model, and a detailed tutorial for analysis with crawl is available here: https://jmlondon.github.io/crawl-workshop/crawl-practical.html. Let’s try to fit the most basic model using the wrapper function crawlWrap. In the most basic form, we only need to provide tracking data with the columns ID, time, x, and y and specify the desired temporal resolution.

First, let us transform the data into an sf object. crawlWrap can also take a data.frame as an argument but that would imply renaming some of our columns. It is easier to just transform the data into an sf object.

Here it’s important to project the data in a good coordinate system, since the model takes into account speed/step length.

track_split_sf <- track_split %>%
  st_as_sf(coords = c("x", "y")) %>% # converts to an sf object
  st_set_crs(4326) %>% # define CRS
  st_transform(2962) # reproject data to a UTM

Now we can fit the CTCRW to each track segment and create a data frame of predicted locations. We decided above that a good, maybe conservative time interval was 30 min. We will use that interval here. Since interpolating when they are large data gaps can bias our analysis, we use the split tracks.

# crawl can fail to fit periodically, so I recommend always setting a seed 
set.seed(12)

# fit crawl
crwOut <- crawlWrap(obsData = track_split_sf, timeStep = "30 mins", theta = c(7, 0))
## Fitting 20 track(s) using crawl::crwMLE...
## Individual T172062-1...
## Individual T172062-2...
## Individual T172062-3...
## Individual T172062-4...
## Individual T172062-5...
## Individual T172062-6...
## Individual T172062-7...
## Individual T172064-1...
## Individual T172064-2...
## Individual T172064-3...
## Individual T172064-4...
## Individual T172064-5...
## Individual T172064-6...
## Individual T172066-1...
## Individual T172066-2...
## Individual T172066-3...
## Individual T172066-4...
## Individual T172066-5...
## Individual T172066-6...
## Individual T172066-7...
## DONE
## 
## Predicting locations (and uncertainty) at 30 mins time steps for 20 track(s) using crawl::crwPredict...
## DONE

Let’s look at a few track segments with interpolated values.

plot(crwOut, animals = "T172062-3", ask = FALSE)

plot(crwOut, animals = "T172064-5", ask = FALSE)

plot(crwOut, animals = "T172066-4", ask = FALSE)

Notice how the predicted tracks do not make perfectly straight lines through missing sections, which is an improvement on what a simple linear interpolation method would provide.

We can now extract the predicted regular tracks.

# Get predicted tracks from crawl output
track_int <- crwOut$crwPredict[which(crwOut$crwPredict$locType == "p"),
                              c( "ID", "mu.x", "mu.y", "time")]
colnames(track_int) <- c( "ID","x", "y", "time")
head(track_int)
##           ID         x       y                time
## 1  T172062-1 -285113.8 8176359 2017-08-08 00:20:00
## 4  T172062-1 -285813.8 8176130 2017-08-08 00:50:00
## 8  T172062-1 -286046.5 8174867 2017-08-08 01:20:00
## 11 T172062-1 -286045.6 8174408 2017-08-08 01:50:00
## 14 T172062-1 -285394.1 8174699 2017-08-08 02:20:00
## 17 T172062-1 -284706.5 8173150 2017-08-08 02:50:00

Note here that the time is at every 30 min.

4.1.5 Place NAs in (small) gaps

Instead of interpolating, one can leave data gaps as NAs. Some analysis, for example hidden Markov models without spatial covariates extracted from the locations, could easily handle NAs in the locations. The maximum size of a gap to allow depends on the frequency of the missing data, frequency of locations, study species, and behaviours of interest. However, for many analyses that assume regular time interval, it’s important to explicitly include the NAs, so that the function knows that there are missing data they have to account for. Placing NAs for missing locations can be used on its own or in conjunction with splitting tracks. The package adehabitatLR has a function setNA dedicated to it.

We will apply this to the split tracks that we used previously used in the tutorial. Here, instead of using crawl to interpolate missing locations, we are simply creating a dataframe with the missing locations set to NA (i.e., creating a regular grid of observations with some NAs). The idea is that we still want to separate tracks into segment when there are large data gaps, but place NAs when it’s just a few locations missing.

The first step is to create an adhabitat trajectory. Here, we simply input coordinates, time, and the ID (here track segment id).

track_ltraj <- as.ltraj(xy = track_split[, c("x", "y")], 
                                   date = track_split$time, 
                                   id = track_split$ID)

Now we set the NAs. As before, we use a time interval of 30 min. We specify that we have a tolerance for the imprecision in the data, here at most 15 min. We use the first location at the time reference.

# Create adehabitat object, containing the trajectory padded with NAs
track_NA <- setNA(ltraj = track_ltraj, 
                  date.ref = track_split$time[1], 
                  dt = 30, tol = 15, units = "min")

# Transform back to dataframe
track_NA <- ld(track_NA)[, c("id", "x", "y", "date")]
colnames(track_NA) <- c("ID", "x", "y", "time")
head(track_NA)
##          ID        x       y                time
## 1 T172062-1 -80.6525 72.2728 2017-08-08 00:20:00
## 2 T172062-1 -80.6692 72.2674 2017-08-08 00:52:00
## 3 T172062-1 -80.6592 72.2563 2017-08-08 01:29:00
## 4 T172062-1 -80.6558 72.2550 2017-08-08 02:04:00
## 5 T172062-1 -80.6316 72.2582 2017-08-08 02:26:00
## 6 T172062-1 -80.5892 72.2403 2017-08-08 03:02:00

We can see now that there are NAs for some locations, and that the time is not exactly at every 30 min, but close to. So this may not be adequate for some analysis that require strict time regularity. One could round to the nearest 30min, but this could be problematic.

4.1.6 Multiple Imputation

For many analyses where you need to extract covariates from the location data, it may be wise to use multiple imputation, where you create multiple tracks with a CTCRW model and extract the covariates for each of the tracks. Then you can propagate the uncertainty in other analyses.

Multiple imputation works by first fitting a CTCRW model to the original data, second, drawing (i.e., simulating) a number of realisations of the position process based on the CTCRW model, third (not done here), fitting your model of interest to each of the simulated realisations, and finally, pooling the estimated parameters. momentuHMM has several functions to implement multiple imputation. The function MIfitHMM can be used both to simulate realisations of a fitted CTCRW and fit hidden Markov model (HMMs) to each one. The number of simulations is specified with nSims. We can simulate realisations without fitting HMMs by setting fit = FALSE.

Here, let’s use first fit a CTCRW model on segmented tracks created above. We will simulate 4 tracks using MIfitHMM.

set.seed(12)

# Fit the correlated random walk, MIfitHMM takes a crwData object
crw_gps_30 <- crawlWrap(obsData = track_split_sf, timeStep = "30 mins")
## Fitting 20 track(s) using crawl::crwMLE...
## Individual T172062-1...
## Individual T172062-2...
## Individual T172062-3...
## Individual T172062-4...
## Individual T172062-5...
## Individual T172062-6...
## Individual T172062-7...
## Individual T172064-1...
## Individual T172064-2...
## Individual T172064-3...
## Individual T172064-4...
## Individual T172064-5...
## Individual T172064-6...
## Individual T172066-1...
## Individual T172066-2...
## Individual T172066-3...
## Individual T172066-4...
## Individual T172066-5...
## Individual T172066-6...
## Individual T172066-7...
## DONE
## Predicting locations (and uncertainty) at 30 mins time steps for 16 track(s) using crawl::crwPredict...
## DONE
# simulate 4 realisations of the 30 min GPS CTCRW model
MI_sim_gps <- MIfitHMM(crw_gps_30, nSims = 4, fit = FALSE)
## Drawing 4 realizations from the position process using crawl...
## DONE
## DONE

This will return warning messages. They are not shown here, but one should look at the segments that resulted in warning messages. This can potentially be fixed by forcing longer track segments above.

Let’s look at one segment: T172066-6.

ID_seg <- "T172066-6"
# plot locations for first narwhal
# filter first ID from original data
track_eg <- track_split_sf %>% 
  mutate(x = st_coordinates(track_split_sf)[,"X"], 
         y = st_coordinates(track_split_sf)[,"Y"]) %>% 
  filter(ID == ID_seg)

# filter first ID for each simulation
sim_tracks <- lapply(MI_sim_gps$miData, function(x){
  filter(x, ID == ID_seg)})

# plot original track for first narwhal
plot(track_eg$x, track_eg$y, col = "red", xlab = "x", ylab = "y", asp = 1, type = "l")

# plot each simulated track
mute <- mapply(function(data, col) {
                 points(y ~ x, data = data, col = col, type = "l")
               }, data = sim_tracks, 
               col = list("cadetblue1", "cadetblue2", "cadetblue3", "cadetblue4"), 
               SIMPLIFY = FALSE)

Notice how in some areas the different simulations have generally good agreement in the likely location during gaps, while in others they diverge. Multiple imputation can be particularly powerful if we want to incorporate environmental variables, as spatially explicit variables can be extracted for each simulated track to sample the most likely conditions encountered by the animal.